<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <title>Elections</title>
    <link rel="stylesheet" href="gettingStarted.css" type="text/css" />
    <meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
    <link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
    <link rel="up" href="rep.html" title="Chapter 13.  Berkeley DB Replication" />
    <link rel="prev" href="rep_mgr_ack.html" title="Choosing a Replication Manager acknowledgement policy" />
    <link rel="next" href="rep_mastersync.html" title="Synchronizing with a master" />
  </head>
  <body>
    <div xmlns="" class="navheader">
      <div class="libver">
        <p>Library Version 12.1.6.2</p>
      </div>
      <table width="100%" summary="Navigation header">
        <tr>
          <th colspan="3" align="center">Elections</th>
        </tr>
        <tr>
          <td width="20%" align="left"><a accesskey="p" href="rep_mgr_ack.html">Prev</a> </td>
          <th width="60%" align="center">Chapter 13.  Berkeley DB Replication </th>
          <td width="20%" align="right"> <a accesskey="n" href="rep_mastersync.html">Next</a></td>
        </tr>
      </table>
      <hr />
    </div>
    <div class="sect1" lang="en" xml:lang="en">
      <div class="titlepage">
        <div>
          <div>
            <h2 class="title" style="clear: both"><a id="rep_elect"></a>Elections</h2>
          </div>
        </div>
      </div>
      <p>
        Replication Manager automatically conducts elections when
        necessary, based on configuration information supplied to the
        <a href="../api_reference/C/reppriority.html" class="olink">DB_ENV-&gt;rep_set_priority()</a> method, unless the application turns off
        automatic elections using the <a href="../api_reference/C/repconfig.html" class="olink">DB_ENV-&gt;rep_set_config()</a> method.
    </p>
      <p>
        It is the responsibility of a Base API application to
        initiate elections if desired. It is never dangerous to hold
        an election, as the Berkeley DB election process ensures there
        is never more than a single master database environment.
        Clients should initiate an election whenever they lose contact
        with the master environment, whenever they see a return of
        <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_HOLDELECTION" class="olink">DB_REP_HOLDELECTION</a> from the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV-&gt;rep_process_message()</a> method, or when,
        for whatever reason, they do not know who the master is. It is
        not necessary for applications to immediately hold elections
        when they start, as any existing master will be discovered
        after calling <a href="../api_reference/C/repstart.html" class="olink">DB_ENV-&gt;rep_start()</a>. If no master has been found after a
        short wait period, then the application should call for an
        election.
    </p>
      <p>
        For a client to win an election, the replication group must
        currently have no master, and the client must have the most
        recent log records. In the case of clients having equivalent
        log records, the priority of the database environments
        participating in the election will determine the winner. The
        application specifies the minimum number of replication group
        members that must participate in an election for a winner to
        be declared. We recommend at least ((N/2) + 1) members. If
        fewer than the simple majority are specified, a warning will
        be given.
    </p>
      <p>
        If an application's policy for what site should win an
        election can be parameterized in terms of the database
        environment's information (that is, the number of sites,
        available log records and a relative priority are all that
        matter), then Berkeley DB can handle all elections
        transparently. However, there are cases where the application
        has more complete knowledge and needs to affect the outcome of
        elections. For example, applications may choose to handle
        master selection, explicitly designating master and client
        sites. Applications in these cases may never need to call for
        an election. Alternatively, applications may choose to use
        <a href="../api_reference/C/repelect.html" class="olink">DB_ENV-&gt;rep_elect()</a>'s arguments to force the correct outcome to an
        election. That is, if an application has three sites, A, B,
        and C, and after a failure of C determines that A must become
        the winner, the application can guarantee an election's
        outcome by specifying priorities appropriately after an
        election:
    </p>
      <pre class="programlisting">on A: priority 100, nsites 2
on B: priority 0, nsites 2</pre>
      <p>
        It is dangerous to configure more than one master
        environment using the <a href="../api_reference/C/repstart.html" class="olink">DB_ENV-&gt;rep_start()</a> method, and applications
        should be careful not to do so. Applications should only
        configure themselves as the master environment if they are the
        only possible master, or if they have won an election. An
        application knows it has won an election when it receives the
        <a href="../api_reference/C/envevent_notify.html#event_notify_DB_EVENT_REP_ELECTED" class="olink">DB_EVENT_REP_ELECTED</a> event.
    </p>
      <p>
        Normally, when a master failure is detected it is desired
        that an election finish quickly so the application can
        continue to service updates. Also, participating sites are
        already up and can participate. However, in the case of
        restarting a whole group after an administrative shutdown, it
        is possible that a slower booting site had later logs than any
        other site. To cover that case, an application would like to
        give the election more time to ensure all sites have a chance
        to participate. Since it is intractable for a starting site to
        determine which case the whole group is in, the use of a long
        timeout gives all sites a reasonable chance to participate. If
        an application wanting full participation sets the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV-&gt;rep_elect()</a>
        method's <span class="bold"><strong>nvotes</strong></span> argument to
        the number of sites in the group and one site does not reboot,
        a master can never be elected without manual intervention.
    </p>
      <p>
        In those cases, the desired action at a group level is to
        hold a full election if all sites crashed and a majority
        election if a subset of sites crashed or rebooted. Since an
        individual site cannot know which number of votes to require,
        a mechanism is available to accomplish this using timeouts. By
        setting a long timeout (perhaps on the order of minutes) using
        the <span class="bold"><strong>DB_REP_FULL_ELECTION_TIMEOUT</strong></span> flag to the
        <a href="../api_reference/C/repset_timeout.html" class="olink">DB_ENV-&gt;rep_set_timeout()</a> method, an application can allow Berkeley DB to
        elect a master even without full participation. Sites may also
        want to set a normal election timeout for majority based
        elections using the <span class="bold"><strong>DB_REP_ELECTION_TIMEOUT</strong></span>
        flag to the <a href="../api_reference/C/repset_timeout.html" class="olink">DB_ENV-&gt;rep_set_timeout()</a> method.
    </p>
      <p> 
        Consider 3 sites, A, B, and C where A is the master. In the
        case where all three sites crash and all reboot, all sites
        will set a timeout for a full election, say 10 minutes, but
        only require a majority for <span class="bold"><strong>nvotes</strong></span> to 
        the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV-&gt;rep_elect()</a> method. Once all
        three sites are booted the election will complete immediately
        if they reboot within 10 minutes of each other. Consider if
        all three sites crash and only two reboot. The two sites will
        enter the election, but after the 10 minute timeout they will
        elect with the majority of two sites. Using the full election
        timeout sets a threshold for allowing a site to reboot and
        rejoin the group.
    </p>
      <p>
        To add a database environment to the replication group with
        the intent of it becoming the master, first add it as a
        client. Since it may be out-of-date with respect to the
        current master, allow it to update itself from the current
        master. Then, shut the current master down. Presumably, the
        added client will win the subsequent election. If the client
        does not win the election, it is likely that it was not given
        sufficient time to update itself with respect to the current
        master.
    </p>
      <p>
        If a client is unable to find a master or win an election,
        it means that the network has been partitioned and there are
        not enough environments participating in the election for one
        of the participants to win. In this case, the application
        should repeatedly call <a href="../api_reference/C/repstart.html" class="olink">DB_ENV-&gt;rep_start()</a> and <a href="../api_reference/C/repelect.html" class="olink">DB_ENV-&gt;rep_elect()</a>, alternating
        between attempting to discover an existing master, and holding
        an election to declare a new one. In desperate circumstances,
        an application could simply declare itself the master by
        calling <a href="../api_reference/C/repstart.html" class="olink">DB_ENV-&gt;rep_start()</a>, or by reducing the number of participants
        required to win an election until the election is won. Neither
        of these solutions is recommended: in the case of a network
        partition, either of these choices can result in there being
        two masters in one replication group, and the databases in the
        environment might irretrievably diverge as they are modified
        in different ways by the masters.
    </p>
      <p>
        Note that this presents a special problem for a replication
        group consisting of only two environments. If a master site
        fails, the remaining client can never comprise a majority of
        sites in the group. If the client application can reach a
        remote network site, or some other external tie-breaker, it
        may be able to determine whether it is safe to declare itself
        master. Otherwise it must choose between providing
        availability of a writable master (at the risk of duplicate
        masters), or strict protection against duplicate masters (but
        no master when a failure occurs). Replication Manager offers
        this choice via the <a href="../api_reference/C/repconfig.html" class="olink">DB_ENV-&gt;rep_set_config()</a> method
        <a href="../api_reference/C/repconfig.html#config_DB_REPMGR_CONF_2SITE_STRICT" class="olink">DB_REPMGR_CONF_2SITE_STRICT</a> flag. Base API applications can
        accomplish this by judicious setting of the <span class="bold"><strong>nvotes</strong></span> and <span class="bold"><strong>nsites</strong></span> 
        parameters to the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV-&gt;rep_elect()</a> method. 
    </p>
      <p>
        It is possible for a less-preferred database environment to
        win an election if a number of systems crash at the same time.
        Because an election winner is declared as soon as enough
        environments participate in the election, the environment on a
        slow booting but well-connected machine might lose to an
        environment on a badly connected but faster booting machine.
        In the case of a number of environments crashing at the same
        time (for example, a set of replicated servers in a single
        machine room), applications should bring the database
        environments on line as clients initially (which will allow
        them to process read queries immediately), and then hold an
        election after sufficient time has passed for the slower
        booting machines to catch up.
    </p>
      <p>
        If, for any reason, a less-preferred database environment
        becomes the master, it is possible to switch masters in a
        replicated environment. For example, the preferred master
        crashes, and one of the replication group clients becomes the
        group master. In order to restore the preferred master to
        master status, take the following steps:
    </p>
      <div class="orderedlist">
        <ol type="1">
          <li>
            The preferred master should reboot and re-join the
            replication group as a client.
        </li>
          <li>
            Once the preferred master has caught up with the
            replication group, the application on the current master
            should complete all active transactions and reconfigure
            itself as a client using the <a href="../api_reference/C/repstart.html" class="olink">DB_ENV-&gt;rep_start()</a> method.
        </li>
          <li>
            Then, the current or preferred master should call
            for an election using the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV-&gt;rep_elect()</a> method.
        </li>
        </ol>
      </div>
    </div>
    <div class="navfooter">
      <hr />
      <table width="100%" summary="Navigation footer">
        <tr>
          <td width="40%" align="left"><a accesskey="p" href="rep_mgr_ack.html">Prev</a> </td>
          <td width="20%" align="center">
            <a accesskey="u" href="rep.html">Up</a>
          </td>
          <td width="40%" align="right"> <a accesskey="n" href="rep_mastersync.html">Next</a></td>
        </tr>
        <tr>
          <td width="40%" align="left" valign="top">Choosing a Replication Manager acknowledgement policy </td>
          <td width="20%" align="center">
            <a accesskey="h" href="index.html">Home</a>
          </td>
          <td width="40%" align="right" valign="top"> Synchronizing with a
        master</td>
        </tr>
      </table>
    </div>
  </body>
</html>
